On-Demand Indexing for Referential Compression of DNA Sequences
نویسندگان
چکیده
The decreasing costs of genome sequencing is creating a demand for scalable storage and processing tools and techniques to deal with the large amounts of generated data. Referential compression is one of these techniques, in which the similarity between the DNA of organisms of the same or an evolutionary close species is exploited to reduce the storage demands of genome sequences up to 700 times. The general idea is to store in the compressed file only the differences between the to-be-compressed and a well-known reference sequence. In this paper, we propose a method for improving the performance of referential compression by removing the most costly phase of the process, the complete reference indexing. Our approach, called On-Demand Indexing (ODI) compresses human chromosomes five to ten times faster than other state-of-the-art tools (on average), while achieving similar compression ratios.
منابع مشابه
Fast search in DNA sequence databases using punctuation and indexing
Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, CompressedPunctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences...
متن کاملHigh-speed and high-ratio referential genome compression
Motivation The rapidly increasing number of genomes generated by high-throughput sequencing platforms and assembly algorithms is accompanied by problems in data storage, compression and communication. Traditional compression algorithms are unable to meet the demand of high compression ratio due to the intrinsic challenging features of DNA sequences such as small alphabet size, frequent repeats ...
متن کاملPractical aspects of Compressed Suffix Arrays and FM-Index in Searching DNA Sequences
Searching patterns in the DNA sequence is an important step in biological research. To speed up the search process, one can index the DNA sequence. However, classical indexing data structures like suffix trees and suffix arrays are not feasible for indexing DNA sequences due to main memory requirement, as DNA sequences can be very long. In this paper, we evaluate the performance of two compress...
متن کامل3D Models Recognition in Fourier Domain Using Compression of the Spherical Mesh up to the Models Surface
Representing 3D models in diverse fields have automatically paved the way of storing, indexing, classifying, and retrieving 3D objects. Classification and retrieval of 3D models demand that the 3D models represent in a way to capture the local and global shape specifications of the object. This requires establishing a 3D descriptor or signature that summarizes the pivotal shape properties of th...
متن کاملDNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database
Modern biological science produces vast amounts of genomic sequence data. This is fuelling the need for efficient algorithms for sequence compression and analysis. Data compression and the associated techniques coming from information theory are often perceived as being of interest for data communication and storage. In recent years, a substantial effort has been made for the application of tex...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 10 شماره
صفحات -
تاریخ انتشار 2015